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ABSTRACT 

In many garbage collected systems, the mutator performs a write 
barrier for every pointer update. Using generational garbage col- 
lectors, we study in depth three code placement options for remembered- 
set write barriers: inlined, out-of-line, and partially inlined (fast 
path inlined, slow path out-of-line). The fast path determines if 
the collector needs to remember the pointer update. The slow path 
records the pointer in a list when necessary. Efficient implemen- 
tations minimize the instructions on the fast path, and record few 
pointers (from 0.16 to 3% of pointer stores in our benchmarks). We 
find the mutator performs best with a partially inlined barrier, by a 
modest 1 .5% on average over full inlining. 

We also study the compilation cost of write-barrier code place- 
ment. We find that partial inlining reduces the compilation cost 
by 20 to 25% compared to full inlining. In the context of just-in- 
time compilation, the application is exposed to compiler activity. 
Regardless of the level of compiler activity, partial inlining consis- 
tently gives a total running time performance advantage over full 
inlining on the SPEC JVM98 benchmarks. When the compiler op- 
timizes all application methods on demand and compiler load is 
highest, partial inlining improves total performance on average by 
10.2%, and up to 18.5%. 
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D.3.4 [Programming Languages]: Processors — Memory manage- 
ment (garbage collection) 
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1. Introduction 

Many garbage collectors remember pointer stores. To avoid col- 
lecting the entire heap, garbage collectors divide the heap into re- 
gions and track the pointers between them. For example in a gen- 
erational copying collector, the collector scavenges the nursery in- 
dependently of higher generations and avoids scanning older gen- 
erations by conservatively assuming that any remembered point- 
ers into the nursery are live. Collector and mutator performance 
depend on the frequency of pointer stores, the number of stores 
remembered, and the benefits from scavanging regions indepen- 
dently. These tradeoffs almost always improve garbage collector 
and total performance [9, 12, 27, 32, 31] (see Section 5.4). 

The write-barrier code sequence determines whether a pointer 
store needs to be remembered, and if so, remembers it. The fast 
path of a conditional write barrier determines if the pointer update 
should be remembered (i.e., it crosses independently collected re- 
gions and the source will be collected before the target). The fast 
path is typically short (3 to 5 instructions) and consists of bit op- 
erations, comparisons, and perhaps loads. The slow path remem- 
bers the pointer update. A remembered set scheme usually puts 
the source in a list. The collector then processes the list at the be- 
ginning of a collection [32]. Efficient collector organizations mini- 
mize the number of remembered pointer stores [3, 8, 23, 31]. Other 
schemes, such as card marking [29, 33], unconditionally set a bit 
in a bit vector to mark a region of memory containing the source 
pointer, and scan for pointers into the increment being collected at 
collection time. Card marking trades off scanning time for a sim- 
pler unconditional barrier. 

Previous research has explored implementations of write barri- 
ers, remembered sets, card marking, and hybrids [7, 23, 22]. Fitzger- 
ald and Tarditi [20] suggest putting the cold path out-of-line. How- 
ever, no previous work measures the impact of this choice. This 
paper investigates the impact of the write barrier on the applica- 
tion code quality, and on the compilation cost in Jikes RVM [1,2] 
with a variety of garbage collectors we developed. We compare 
no write barrier, a completely inlined write barrier, an out-of-line 
write barrier, and a partially inlined barrier (fast path inlined, slow 
path out-of-line). We implement the out-of-line cases with a direct 
procedure call. We use a variety of copying collectors, and also 
compare with a semi-space copying collector which has no write 
barrier. 

Our results first confirm that the slow path is rarely taken (be- 
tween 0.15 and 3%) for the collectors and SPEC JVM98 bench- 
marks we examine, which is consistent with previous languages 
and systems [26, 22, 30]. Compared with no write-barrier, inlin- 
ing increases the application code size by on average 81%, partial 
inlining by 33%, and out-of-line by 21%. Partial inlining how- 
ever provides the fastest executing application code by a modest 
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1.5%. Jikes RVM thus integrates and optimizes the fast path in- 
structions well, and the overhead of the direct procedure call on the 
infrequently taken slow path is minimal. The inlined write barrier 
suffers because the slow path bloats the code size, and since it is 
rarely taken, yields no performance benefits. 

The compilation cost of the write barrier has two components in 
the Jikes RVM. (1) The obvious component - the compiler gen- 
erates and optimizes more code when the write barrier is inlined 
than when it is out-of-line or partially inlined. (2) The compiler 
itself executes the write barrier as it runs in the Jikes RVM. Be- 
cause the Jikes RVM compiler is written in Java and runs along 
with the application in the JVM, it uses the same write barrier as 
the application as it compiles the application. Both costs slow down 
the compiler when it inlines or partially inlines the write barrier 
compared with an out-of-line write barrier. We show that partial 
inlining can overcome this degradation with application time im- 
provements. An inlined write barrier slows down the application 
and a just-in-time compiler twice; by 10.2% on average, and up to 
18.5% compared with partial inlining. Even an out-of-line barrier is 
competitive with inlining (1.4% worse to 15.5% better) when com- 
pilation load is a high percent of total time. Although the overall 
result may now seem predictable to the informed reader, we were 
startled by the magnitude of the differences. In summary, partial in- 
lining is me best choice for a conditional write barrier, and inlining 
is especially problematic for a just-in-time compiler. 

This paper is organized as follows. We first describe related work 
on incremental collectors and their write barriers, including poli- 
cies and mechanisms. We then overview the generational collec- 
tors, write barriers, and methodology we use. Our results section 
demonstrates the static and dynamic application and compilation 
costs of the different code placement choices. These results show 
the partially inlined barrier is the best choice, and in some circum- 
stances, dramatically improves performance over inlining or out- 
of-line barriers. 

2. Related Work 

In this section, we discuss related work on write barrier function 
and designs, and general compiler inlining. To our knowledge, no 
one has studied the impact of write barrier inlining strategies be- 
fore. Write barriers are required in two distinct contexts: incremen- 
tal garbage collection and reference counting garbage collection. 

Incremental collectors depend on write barriers to record point- 
ers into independently collected regions of memory called incre- 
ments. By tracking all pointers into an increment, the increment 
can be safely collected by making the conservative assumption that 
the sources of all incoming pointers are live. If the number of 
incoming pointers is suitably low, incremental collection can be 
very efficient. Incremental collection is the basis for a large num- 
ber of garbage collectors including generational [27, 32, 3], older- 
first [31], Beltway [8], and mature object space collectors [25]. 
Jones and Lins [26] describe many more incremental algorithms. 

A write barrier for incremental collection can usually be charac- 
terized in terms of three implementation choices. 1) A mechanism 
for determining whether to remember a pointer update. 2) A de- 
sign decision as to what should be remembered. 3) A mechanism 
for how it should be remembered. The literature records a large 
number of alternatives in this space [26]. Two broad approaches 
are widely used: remembered sets and card marking. 

2.1 Remembered Sets. 

Remembered sets typically remember either the source object or 
slot (pointer field) [26], both of which we study here. Ungar was 
the first to suggest the object remembering barrier [32], and the 



Jikes RVM generational garbage collectors [2] also remember ob- 
jects. They use a bit in the source object to avoid duplicates, and to 
avoid remembering source objects that reside in the nursery. 

Collectors may remember the exact slot instead of the object con- 
taining the pointer [3, 24, 23, 30]. Stefanovi6 et al. developed a very 
fast address order write barrier that exploits an address order or- 
ganization of collection increments in the older-first collector. The 
write barrier we use is similar, and depends on generations being 
organized within major virtual memory alignment boundaries to 
avoid explicit generational bounds checking. This structure leads 
to a very fast barrier (see Figure la) and c)), that does not require 
loading explicit generation bounds for comparison with the source 
and target pointers. 

The literature also reports a number of remembered set imple- 
mentations. Hudson and Diwan use a sequential store buffer to 
remember slots [24, 23]. They use virtual memory protection to 
detect buffer overflow. The Jikes RVM collectors use a similar 
structure for storing remembered objects although it uses an ex- 
plicit bounds check to detect buffer overflow. Our collectors re- 
member slots in power-of-two aligned buffers, and use the power of 
* two alignment to detect buffer overflow without an explicit buffer 
bounds pointer. 

2.2 Card Marking 

Card marking uses a table to remember fixed size regions of mem- 
ory (cards) as pointer sources [29, 33], The write barrier marks 
cards when necessary, and the collector treats memory regions cor- 
responding to marked cards as roots, scanning them for pointers. 
The collector clears the card table at the end of a collection. Sev- 
eral papers consider the efficiency of card marking schemes [33, 
10, 21, 23], as well as a hybrid of remembered sets and card mark- 
ing [22]. Hosking et al. compare remembered sets and cards in an 
interpreted Smalltalk system [23, 22] and find their performance is 
similar. Our study is of limited relevance to card marking because 
most card marking barriers use a very short and unconditional code 
sequence. 

2.3 Reference Counting 

Reference counting algorithms rely on a write barrier to update ref- 
erence counts at each pointer store. A classic [13] reference count- 
ing algorithm might employ a conditional write barrier (reclaim 
the object if the count is zero). However the more widely used 
deferred reference counting [18] approach depends on an uncon- 
ditional write barrier. It remembers pointer stores unconditionally 
and processes them from time to time. Our key findings depend on 
the conditionality of the incremental collector's write barrier, and 
the fact that much of the write barrier is rarely executed. Therefore 
our results are not likely to apply directly to reference counting 
garbage collector write barriers. 

2.4 Inlining 

For a long time, compilers have used inlining to improve code per- 
formance, and have for the most part used heuristics based on code 
size and/or profiling to limit the code bloat effects of inlining [4, 
6, 28]. Cooper et al. find that inlining can degrade highly opti- 
mizing compilers, and expose non-linear compiler algorithms [16, 
15]. Cooper et al. [14] and Dean and Chambers [17] show that 
not all inlining is equal, and its judicious application improves per- 
formance. Both suggest frequency as a criteria for their automatic 
procedure inlining. We use a Jikes RVM compiler pragma to per- 
form partial inlining and isolate the infrequently executed instruc- 
tions. Because the write barrier is so prolific, this choice has a large 
impact. Our results suggest partial inlining for other prolific code 
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sequences with hot and cold paths, such as the allocation sequence, 
should be profitable. It might also be worth investigating compiler 
partial inlining using branch-frequency profile feedback. 

3. Collectors and Barriers 

This section briefly presents the generational garbage collectors 
that we use in our study. These collectors require a write barrier 
and this section describes the two styles we implement. 

3.1 Garbage Collectors 

We use an Appel-style generational collector [3] as the basis of our 
study because, to our knowledge, it is the best performing gener- 
ational collector [8]. We also gather write barrier statistics for a 
fixed-size two generational collector to explore a wider range of 
collector behaviors. We compare with a non-generational semi- 
space collector (which does not require a write barrier) to reveal 
the overall cost of using a write barrier. We now briefly describe 
these collectors. 

5. 7. 7 Non-Generational Copying Collection 

The semi-space copying collector [26] simply divides the heap into 
two equal-sized semi-spaces, io-space and from-space. It always 
uses to-space for allocation. When allocation exhausts the to-space, 
the collector is invoked. It first flips the spaces, to-space becomes 
from-space, and from-space becomes to-space. The collector then 
identifies all live objects in from-space by tracing from the roots 
(globals, stack variables, etc.). It copies each live object into to- 
space. Each time the collector encounters a pointer to an object 
in from-space, it replaces the pointer with a pointer to the objects 
new location in to-space. It then reclaims from-space en masse. 
The mutator resumes allocating into to-space until it is exhausted, 
at which point the collection cycle resumes. 

3.1.2 Generational Copying Collection 
Because researchers observed that 'most objects die young', they 
constructed generational collectors to collect the youngest objects 
more frequently and thus improve upon the simple semi-space col- 
lector. This observation is widely known as the weak generational 
hypothesis [32]. A generational collector extends the semi-space 
collector by allocating into a younger generation (or nursery) and 
collecting it frequently. It copies those objects that survive the nurs- 
ery into the to-space of the older generation. Filling the to-space of 
the older generation triggers a full heap collection. This collection 
considers the entire heap and identifies live objects by tracing from 
roots just as the semi-space collector does. 

In practice most objects do die young, so expensive full heap col- 
lections are infrequent, giving generational collection a significant 
performance advantage over simple semi-space collection. How- 
ever, this advantage depends on collecting the nursery indepen- 
dently, i.e., without tracing the full heap to identify live nursery 
objects. To achieve this goal, a generational collector remembers 
all pointers into the nursery from the older generation. 1 When it 
collects the nursery, it conservatively assumes all pointers into the 
nursery are live. As we discussed in Section 2, there are a number 
of approaches to remembering such pointers. All of them depend 
on a write barrier to trap pointer stores and, when necessary, re- 
member the source of the pointer store. 



*Not all generational collectors use a write barrier, some go to the 
expense of tracing the higher generation as part of each nursery 
collection, and in some circumstances a read barrier is an effective 
implementation choice [1 1]. 



3. 7. 3 The Appel-style Generational Collector 
Generational collectors are often implemented with a fixed fraction 
of the heap reserved for use by the nursery [26]. Appel [3] de- 
scribes an alternative implementation which allows the nursery to 
consume all the space not used by the higher generation's to-space 
and from-space. Thus initially, when the higher generation's to- 
space is empty, the nursery occupies half of the heap. 2 The nurs- 
ery shrinks as to-space of the older generation grows. When the 
older generation is full, the collector performs a full heap collec- 
tion, which will typically shrink the higher generation's to-space 
and thus expand the nursery. By always deferring collection un- 
til all of the available heap space is consumed, the flexible nursery 
approach makes efficient use of space, and collects the higher gen- 
eration less frequently. We find mis orginization performs quite a 
bit better than a fixed-size nursery collector [8]. 

3.2 Write Barriers 

We consider an Appel generational collector with two generations, 
a nursery and an older generation. We use remembered sets, which 
are lists of remembered pointer sources, to track pointers from the 
older generation into the nursery. The write barrier produces re- 
membered set entries, when necessary. The collector consumes a 
remembered set and treats each entry as a root during incremental 
collection (i.e., nursery collection). 

We examine the two common write barriers: a slot remembering 
write barrier, which remembers the slot (object field) containing 
the source pointer, and an object remembering write barrier, which 
remembers the object containing the source pointer [26]. Both are 
widely used [32, 3, 26, 3 1, 8, 2]. Each barrier makes slightly differ- 
ent tradeoffs. The slot barrier is likely to remember more pointers. 
With an object barrier, the collector must scan the source objects 
during a collection. The point of this paper is not to explore the 
relative merits of either approach, but by examining both we add a 
degree of generality to our results. 

There are two distinct components to the barrier implementa- 
tion, a frequently executed fast path, which tests whether a pointer 
should be remembered, and an infrequently executed slow path, 
which stores the pointer into a remembered set when necessary. We 
now describe our implementation of the slot and object barriers. 

3. 2. 1 Slot Remembering Write Barrier 
The slot remembering write barrier remembers the addresses of 
pointers into the nursery. It tests each pointer store and if the source 
is outside the nursery and the target is within the nursery, it remem- 
bers the address of the source in a remembered set. The collector 
consumes the remembered set at collection time; it examines each 
entry to see if the remembered address still contains a pointer into 
the nursery. (The program may have overwritten the pointer with 
an uninteresting value after it was recorded in the remembered set.) 
If so, the collector marks the pointed-to object live. Otherwise, it 
ignores the pointer. 

Figure la) illustrates the Java code for our fast path implemen- 
tation. By locating the nursery and older generation on different 
sides of a major virtual memory alignment boundary (2^), we are 
able to apply Stefanovid et al's very cheap address order write bar- 
rier [31] to generational collection. We put the nursery in high 
memory, and older generations in successively lower memory re- 
gions. We then simply mask the lower K bits in the target and if 

2 Generational collectors must reserve half of the heap for the 
higher generation's from-space to accommodate the worst case sur- 
vival rate in a full heap collection. Thus when the higher genera- 
tion to-space is empty, half of the total heap space is available to 
the nursery in an Appel-style collector. 
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i public static final void writeBarrier (ADDRESS source, ADDRESS target) { 
a if (source < ( (target>>>HEAP_K) «HEAP_K) ) { 
3 GCTk_WriteBuf f erSlot. insert (source) ; 

« } 

s } 
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Figure 1: The slot remembering write barrier used by the GCTk generational collectors. 



1 public static final void writeBarrier (Object source) { 

2 int statusWord • VM_Magic . get IntAtOff set (source, OBJECT_STATUS_OFFSET) ; 

3 if ((StatusWord & OBJECT_BARRIER_MASK) != 0) { 

4 GCTk_WriteBuf ferObject. insert (source, statusWord); 

* } 
« ) 
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Figure 2: The object remembering write barrier used by the GCTk generational collectors. 
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the source is less than the shifted target, we remember it (line 2 in 
Figure la)). We use two shifts to perform a mask to avoid a two- 
step direct mode and. In fact the two shifts are folded into a single 
PowerPC mask operation by the Jikes RVM optimizing compiler. 
This barrier generalizes to N generations, as long as each genera- 
tion is contained within a 2 K -aligned virtual memory region, and 
the generation ordering is preserved in the memory organization. 
The PowerPC instruction sequence for the fast path appears in Fig- 
ure lc), instructions 1 though to 3. 

When we remember a pointer, we put it in a write buffer, as 
illustrated in Figure lb). The write buffer is implemented as a sim- 
ple sequential store buffer [24, 23]. We implement the buffer us- 
ing a chain of power-of-two sized chunks (write_buffer_buf_size 
= 2"). We exploit the power of two alignment of each chunk to 
perform a cheap bounds test (Figure lb), line 7). The Jikes RVM 
optimizing compiler produces tight code for the slow path as shown 
in Figure lc), instructions 4 through to 20. It further optimizes this 
code in context after inlining. The slot barrier is the default for the 
generational collectors in our garbage collection toolkit (GCTk). 3 

3.2.2 Object Remembering Write Barrier 
The generational collectors that come with Jikes RVM implement 
an object remembering barrier [32]. This barrier tests each pointer 
store, and remembers the source object if necessary. At collection 
time, the collector treats each remembered object as live and scans 
it for pointers into the nursery. It marks live any pointed-to nurs- 
ery object. As with the slot remembering barrier, this barrier has 
a frequently executed fast path, and an occasionally executed slow 
path. We now describe our implementation of the object remem- 
bering barrier in GCTk. (Our implementation closely follows the 
Jikes RVM implementation. It differs only in the slow path. We 
use the fast power-of-two bounds check described above for the 
write buffer, and the Jikes RVM uses a comparison with an explicit 
end-of-buffer value, which requires an additional load.) 

Figure 2a) illustrates the Java code for the fast path implementa- 
tion. This code remembers a source object if the object_barrier 
bit is set in the source object's status word. Once it remembers 
the object, it clears the bit. Then it will not remember any subse- 
quent stores for an object with a clear object_barrier bit. Note 
that because this bit is clear in any newly allocated object, the bar- 
rier automatically ensures that it never remembers nursery objects. 
When the collector copies an object into the higher generation, it 
sets the bit to ensure the barrier will remember subsequent stores 
to it. When it processes the remembered set, it resets each object's 
ob ject_b arr i er bit (which was cleared during the write barrier). 
Note that this barrier is somewhat conservative because it remem- 
bers all old generation objects into which a pointer is stored regard- 
less of whether the pointer is into the nursery. It need not store older 
to older generation pointers. The PowerPC instruction sequence for 
the object remembering barrier fast path appears as lines 1 through 
to 3 in Figure 2c). 

The slow path for the object remembering barrier is identical to 
that for the slot remembering barrier except that it first clears the 
object_barrier bit in the source object. The Java code for this 
barrier is illustrated in Figure 2b), and the corresponding instruc- 
tion sequence appears in Figure 2c). 

4. Methodology 

We explore the impact of write barrier inlining strategies by con- 
sidering two metrics: code quality and compilation workload. The 

3 GCTk implements a number of collectors and works in the Jikes 
RVM. We designed it to be more general and extendible than the 
existing collectors in the Jikes RVM. 



former measures how the choice of inlining strategies impact the 
performance of the compiled code, while the latter measures how 
the choice impacts on the amount of work the compiler must do to 
compile each barrier. Code quality is of obvious importance, and in 
a dynamic compilation context, such as Java, compilation workload 
is also important because the application is exposed to the compiler 
through just-in-time compilation. 

We measure code quality by timing the SPEC JVM98 bench- 
marks in Jikes RVM, and measure compiler workload by timing 
the Jikes RVM optimizing compiler building a Jikes RVM boot 
image while running on the Sun HotSpot JVM. We describe this 
experimental environment below. 

4.1 JIT and GC Environment 

We use the Jikes RVM (formerly known as Jalapeno) and GCTk, 
which is a GC toolkit for Jikes RVM which we have recently de- 
veloped. 4 We now overview each of these. 

4.1.1 JikesRVM 

Jikes RVM is a high performance VM written in Java with an ag- 
gressive optimizing compiler [2, 1]. Jikes RVM offers three com- 
piler choices: baseline, a quick non-optimizing compiler for all 
methods; optimizing, an aggressive optimizing compiler for all meth- 
ods; and adaptive, it intially uses baseline and adaptively recom- 
piles hot methods with the optimizing compiler. The adaptive com- 
piler uses sampling to select optimization candidates, and thus tends 
to make slightly different choices for each execution. This non- 
determinism makes the adaptive compiler a difficult platform for 
any detailed study of the optimizing compiler. Since our focus is 
on the behavior of optimizing compilation rather than when to use 
it, we measure the optimizing compiler for our key results. We use 
the adaptive compiler to contextualize our results with a realistic 
indicator of compilation workload. Jikes RVM can be configured 
with two levels of ahead-of-time compilation. A minimal config- 
uration only precompiles those classes essential to bootstrapping 
the VM (which does not include the optimizing compiler). We use 
the configuration which precompiles as much as possible, includ- 
ing key libraries and the optimizing compiler. We also turn off 
assertion checking for our experiments. 5 

4.1.2 GCTk 

GCTk is an efficient and flexible platform for GC experimenta- 
tion that exploits the object-orientation of Java and the VM-in- 
Java property of Jikes RVM. We have implemented a number of 
GC algorithms in GCTk and found their performance to be similar 
to those of existing Jikes RVM GC implementations. The GCTk 
collectors used here (semi-space, fixed-nursery generational, and 
Appel-style generational) are well tuned. Each of the collectors 
shares a common infrastructure. The write barriers share common 
sequential store-buffer code. The Appel-style and fixed-nursery 
generational collectors share all the same code except their collec- 
tion triggering rule. 

4.2 Benchmarks 

In Section 5.2, we measure compiler code quality by timing the 
SPEC JVM98 benchmarks. Table 1 shows key compile time and 
run time characteristics of each of the benchmarks. 

For these results, we compile each benchmark with the Jikes 
RVM optimizing compiler using fully inlined write barriers (which 
is the default behavior). For consistency between executions, we 

4 GCTk is publicly available at http://cs.umass.edu/-gctk. 
5 This build-time configuration is known as Fast. 
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Compile Time 


Run Time 




Bytecodes 


Instructions 


Write 


Minimum 


Pointer 




compiled 


generated 


Barriers 


heap 


stores 


_201 -compress 


13.5KB 


136.8KB 


240 


19MB 


21K 


_202_jess 


31.1KB 


317.1KB 


428 


11MB 


34.0M 


-205-raytrace 


21.7KB 


202.8KB 


303 


15MB 


6.6M 


.209 jdb 


15.0KB 


154.0KB 


236 


22MB 


30.0M 


-213-javac 


72.2KB 


579.2KB 


1929 


27MB 


25.5M 


_222_mpegaudio 


28.9KB 


197.4KB 


344 


11MB 


<1K 


_227_mtrt 


21.7KB 


202.1KB 


303 


22MB 


8.1M 


_228_jack 


35.0KB 


319.7KB 


519 


13MB 


9.4M 



Table 1: Benchmark Characteristics 



do not use the Jikes RVM adaptive compiler [5]. The optimizing 
compiler compiles all methods required for the execution of each 
benchmark. The compilation includes the SPEC JVM98 harness 
and any additional libraries required by the benchmark that are not 
precompiled as part of the Jikes RVM boot image. Table 1 indicates 
the volume of Java bytecodes compiled, the volume of instructions 
produced, and the (static) number of write barriers compiled. 

The minimum heap size column in Table 1 indicates the mini- 
mum heap in which the benchmark will run when using the Jikes 
RVM optimizing compiler and the GCTk Appel-style collector (this 
heap size is inclusive of the memory requirements of the optimiz- 
ing compiler compiling the benchmark). The pointer store column 
shows the number of times the write barrier fast path executes in a 
run of the benchmark that does not include any compilation over- 
head. We obtain this statistic by running the benchmarks twice, and 
measuring the second iteration. 

4.3 Experimental Platform 

We use Jikes RVM version 2.0.2, turn off run-time assertion check- 
ing, and use the optimizing compiler with the build-time configu- 
ration which pre-compiles as many classes into the boot image as 
possible. These experiments use a 733MHz Macintosh PowerMac 
G4, with 32KB on-chip LI data and instruction caches, a 256KB 
unified L2 cache, 1MB L3 off-chip cache, and 384MB of memory, 
running PPC Linux 2,4.10. 

We measure compilation time for the various write barriers by 
compiling the Jikes RVM boot image using its optimizing com- 
piler running on the Sun HotSpot Client VM version 1.3. 1 . For this 
experiment, we use a Dell Precision 340 with a 1 .7GHz Intel Pen- 
tium 4, with an 8KB LI data cache, a 12K LI instruction cache, a 
256KB unified L2 on-chip cache, and 512MB of memory running 
Linux 2.4.7. 

5. Results 

This results section first shows the dichotomy between the slow 
and fast path execution frequencies for slot and object write barri- 
ers which motivates further exploration of write barrier code place- 
ment. We then compare application code quality using the out-of- 
line, partially inlined, and inlined write barriers. These results ex- 
clude both compilation costs and garbage collection time, and show 
that partial inlining yields a small but consistent advantage in exe- 
cution time over inlining, which is slightly better than out-of-line, 
i.e., code quality improves when the slow path is out-of-line. 

We then explore compilation costs. We measure the time taken 
by the Jikes RVM optimizing compiler to compile the methods in 
the boot image. We attain the expected result that inlining is slowest 
to compile, and out-of-line is fastest. The partially inlined barriers 
take between 20% and 25% less time to compile than the fully in- 
lined barrier, with a similar reduction in the number of instructions 
generated. The magnitude of this difference is unexpected. Many 
systems inline the entire write barrier. 



We also examine the combination of collector and write barrier. 
We find, not surprisingly, that despite the runtime overhead of the 
write barrier, the Appel-style generational collector performs sub- 
stantially better than a semi-space collector. The partially inlined 
barrier always achieves the best performance for both just-in-time 
(on average 9.9% better than inlining) and ahead-of-time compila- 
tion (on average 1 .4% better than inlining). 

5.1 Slow Path Execution Frequency 

As we discussed in Section 3.1, a two generational copying collec- 
tor tracks pointers from the older to younger generation to avoid 
scanning and copying the older generation. Table 2 shows the rel- 
ative number of dynamic pointer stores for each application, in- 
cluding compilation. We use this configuration instead of just the 
application take rates, because the latter were even smaller, further 
exaggerating the results. 

We experiment with both a slot and object write barrier, and three 
different two generational collector configurations: an Appel-style 
collector, and two fixed-size nursery collectors with nurseries con- 
sisting of 5% and 10% of the usable heap. We measure the number 
of pointers stored over 8 heap sizes from 1 to 3.25 x the minimum 
heap size as reported in Table 1, and report the geometric mean of 
these frequencies. The heap size affects the nursery size, and thus 
the frequencies. These results are typical [30], and demonstrate that 
these programs take the slow path less than 3% of the time, and in 
many configurations, much less than 1%. The next section shows 
we can exploit this dichotomy. 





Appel-style 


5% Nursery 


10% Nursery 


Slot 


Object 


Slot 


Object 


Slot 


Object 


_20 1 -compress 


0.52% 


0.47% 


0.77% 


0.71% 


0.77% 


0.71% 


_202_jess 


0.47% 


0.56% 


1.12% 


1.72% 


0.83% 


1.03% 


_205_raytrace 


0.56% 


0.56% 


1.13% 


1.13% 


0.77% 


1.13% 


_209_db 


0.16% 


0.06% 


0.47% 


0.41% 


0.27% 


0.23% 


-213-javac 


0.58% 


1.07% 


1.07% 


2.02% 


1.07% 


1.58% 


_222_mpegaudio 


0.58% 


0.45% 


1.67% 


1.83% 


1.19% 


1.10% 


-227-mtrt 


0.38% 


0.38% 


0.93% 


0.93% 


0.64% 


0.93% 


-228_jack 


2.47% 


1.11% 


1.11% 


2.81% 


1.11% 


1.84% 


Geometric mean 


0.37% 


0.29% 


1.34% 


135% 


0,93% 


0.94% 



Table 2: Frequency with which the slow path is taken for three 
collector configurations and two write barriers. 



5.2 Application Code Quality 

We investigate three options for write barrier placement: a fully in- 
lined write barrier (inline), an out-of-line write barrier implemented 
with a direct method call (out), and a partially inlined write barrier 
with the fast path inlined and the slow path out-of-line (partial). 
This section presents results for the code quality of the application, 
as one would see in a traditional ahead-of-time compiler. In addi- 
tion to excluding the compilation time, we also exclude the garbage 
collection time because of the impact of compiler-generated heap 
objects on garbage collection. 

We measure application time by executing each benchmark twice 
using the optimizing compiler which does all its work on the first 
iteration. The application time is the total time for the second it- 
eration of the benchmark less any garbage collection time during 
that iteration. We measure each benchmark five times and record 
the best time. We run the benchmark using 8 different heap sizes 
from 1 x to 3.25 x the minimum heap size and report the geomet- 
ric mean of the best times for each of these heap sizes (see Table 1 
for minimum heap size data). We also report the geometric mean 
of the application time results for all benchmarks, and normalize 
against the inlined barrier. 
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Slot Remembering Barrier 


Object Remembering Barrier 


Inline 


Partial 


Out 


Inline 


Partial 


Out 


_20L compress ! 


100.0% 


99.0% 


99.2% 


100.0% 


99.8% 


100.4% 


_202.jess 


100.0% 


99.0% 


113.7% 


100.0% 


100.1% 


105.9% 


_205_raytrace 


100.0% 


92.9% 


97.6% 


100.0% 


99.5% 


101.6% 


-209_db 


100.0% 


99.2% 


102.2% 


100.0% 


100.0% 


102.4% 


_213.javac 


100.0% 


97.4% 


103.4% 


100.0% 


98.3% 


102.0% 


-222_mpegaudio 


100.0% 


99.4% 


100.7% 


100.0% 


99.6% 


100.6% 


_227_mtrt 


100.0% 


101.1% 


105.0% 


100.0% 


100.1% 


101.2% 


J28.jack 


100.0% 


99.9% 


104.6% 


100.0% 


99.9% 


101.7% 


Geometric mean 


100.0% 


98.5% 


103.2% 


100.0% 


99.7% 


102.0% 



Table 3: Average application running time (excluding GC), 
normalized against running time with an inlined barrier. 



Table 3 shows that partial inlining is the best choice for the slot 
barrier in all benchmarks, and the best for the object barrier on 5 of 
the benchmarks. The out-of-line barrier is the worst of the choices, 
on average 2% (the object barrier) and 3.2% (the slot barrier) worse 
than inlining, except for _20Lcompress and .202. jess. We were a 
little surprised the out-of-line barrier did not do worse, but this ma- 
chine and compiler apparently implement a direct procedure call 
very well. On average, partial inlining offers an overall improve- 
ment over full inlining of about 1 .5% for the slot barrier and 0.3% 
for the object barrier. 

This result is a little surprising. Common practice might suggest 
that fully inlined code will perform better. However, the slow path 
frequency numbers in Table 2, the dynamic pointer store statistics 
in Table I, and the compilation statistics reported in Table 4 all 
indicate instruction locality as one explanation. The rarely exe- 
cuted write barrier slow path is proliferated throughout the com- 
piled code, increasing the code volume by around 30% over the 
out-of-line case (see Table 4). A second potential explanation is 
that increasing the register pressure by inlining the slow path de- 
grades code quality. Most likely a combination of these effects 
accounts for the performance degradation suffered by the fully in- 
lined code. 

A call to a static method in Jikes RVM takes a minimum of three 
instructions and typically more. 6 Thus the call alone increases the 
number of instructions inlined by at least 100% on top of the fast 
path instruction sequence in the case of partial inlining. The com- 
piler could reduce the mixing of hot fast-path and cold slow-path 
instructions in the partial inlining case by pushing the call sequence 
to the end of a code block [19]. We expect this optimization would 
further improve the performance of partial inlining, but do not ex- 
plore it here. 

For traditional ahead-of-time compilation, these results suggest 
that a partially inlined write barrier will attain consistent, but small 
execution time improvements for the slot barrier on a variety of 
benchmarks. 

5.3 Compile-time Costs 

Dynamically compiled languages like Java directly expose the ap- 
plication to the compiler. We now measure the compilation over- 
head for the three write barrier code placement strategies. In this 
section, we tease apart the compile time to optimize code with dif- 
ferent write barriers, and its effect on compilation time itself (be- 
cause the Jikes RVM compiler is subject to the choice of write bar- 
rier as well). 

Because the Jikes RVM optimizing compiler normally executes 
in the context of Jikes RVM, changing the write barrier changes 
the performance and allocation of the compiler itself as it executes, 



6 Note that in Figures lc) and 2c), the static method calls take seven 
instructions (lines 1 1-17 and 13-19 respectively). 



i.e., the compiler must execute the same barrier that it is compil- 
ing. Direct measurements of the optimizing compiler within the 
Jikes RVM are thus problematic because the code quality issues 
raised in Section 5,2 are superimposed onto any variation in com- 
piler workload the different barriers impose. For these reasons, we 
measure the Jikes RVM optimizing compiler when compiling the 
Jikes RVM boot image in the context of a host JVM (Sun's HotSpot 
JVM). This strategy holds constant the execution context for the 
compiler, and only changes the compiler's results due to the write 
barrier. 

The compilation of the boot image is a substantial test of the op- 
timizing compiler. It compiles 9743 methods and 43004 write bar- 
riers. For the fully inlined slot barrier, it generates 1 1.4MB of in- 
structions, and spends about 15 minutes in compilation. The results 
in Table 4 show the impact of the three code placement strategies 
on the compiler workload, as measured by instructions generated, 
and time taken to compile the Jikes RVM boot image classes. As 
with our other results, we report the best time from five runs. 





Slot Remembering Barrier 


Object Remembering Barrier 


Inline Partial Out 


Inline Partial Out 


Output 


100.0% 


73.8% 


66.9% 


100.0% 


76.6% 


69.4% 


Time 


100.0% 


75.2% 


62.2% 


100.0% 


79.1% 


69.0% 



Table 4: Compiler workload expressed in terms of instructions 
generated (Output) and compilation time (Time), for the slot 
and object barriers all normalized to inlining using the Jikes 
RVM on the HotSpot JVM. 

These results indicate that fully inlining write barriers substan- 
tially increases the compiler workload over the other choices. Par- 
tial inlining reduces total compiler costs by around 20% to 25%, 
and the out-of-line barrier reduces it more, by 30% to 35%. Note 
also that the improvements in compilation time are approximately 
linear in the reduction in instructions generated. This result means 
that the compilation work is proportional to the number of instruc- 
tions it generates. (Good news for the compiler!) For our study, 
the fundamental problem is that fully inlining the write barrier in- 
creases the compiler load significantly without any corresponding 
increase in the resulting code performance, and, in fact, slightly 
degrades code quality. 

When compilation time exceeds about 10% of total time and be- 
comes more dominant, these results even suggest an out-of-line 
barrier as a good choice! As we show in the next section, this 
suggestion does not hold up in our system. Instead, the cost of 
executing the out-of-line barrier in the very pointer store intensive 
compiler degrades the compiler performance by more than the time 
to compile it. 

5.4 Total Running Time 

Having shown that a partially inlined write barrier executes faster 
and substantially reduces compiler workload compared to a fully 
inlined barrier, and that the out-of-line barrier is more ambiguous 
with respect to these criteria, we now examine their overall exe- 
cution time impact. Because the total running time is so heavily 
impacted by the level of compiler activity, we show best and worst 
case results for compiler activity. In this context, best is no compi- 
lation (everything is already compiled) and worst is all application 
methods compiled at the time of their execution. 

Table 5 illustrates total running time results for slot and object 
barriers compiled inlined, partial, and out-of-line using the Appel- 
style generational collector. We include for comparison the running 
time for the simple semi-space collector, which has no write bar- 
rier. We again pick trie best of 5 runs and compute the geometric 
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Appel 


-style generational collector 




Semi-space 


Slot Remembering Barrier 


Object Remembering Barrier 




Inline 


Partial 


Out 


Inline 


Partial 


Out 




_20 1 -compress 


100.0% 


97.0% 


97.6% 


99.4% 


96.7% 


99.6% 


108.7% 


_202.jess 


100.0% 


86.2% 


93.6% 


95.0% 


88.6% 


91.4% 


207.8% 


_205_raytrace 


100.0% 


88.5% 


92.2% 


87.5% 


82.1% 


86.4% 


159.8% 


-209-db 


100.0% 


98.4% 


100.2% 


99.8% 


98.9% 


101.4% 


114.2% 


.213-javac 


100.0% 


81.5% 


84.5% 


97.6% 


83.0% 


84.8% 


119.9% 


_222_mpegaudio 


100.0% 


90.4% 


91.9% 


97.3% 


91.6% 


93,0% 


104.5% 


_227_mtrt 


100.0% 


90.5% 


96.2% 


97.4% 


93.7% 


97.1% 


138.1% 


_228_jack 


100.0% 


87.0% 


89.9% 


95.7% 


87.5% 


89.9% 


141.4% 


Geometric mean 


100.0% 


89.8% 


93.2% 


96.1% 


90.1% 


92.8% 


133.5% 



a) Total Time including compilation (first iteration)* 





Appel-style generational collector 


Semi-space 


Slot Remembering Barrier 


Object Remembering Barrier 




Inline 


Partial 


Out 


Inline 


Partial 


Out 




.201. compress 


100.0% 


98.8% 


99.1% 


99.2% 


98.9% 


100.5% 


113.0% 


-202-jess 


100.0% 


98.9% 


112.6% 


97.6% 


97.5% 


105.4% 


372.2% 


_205_raytrace 


100.0% 


93.1% 


97.7% 


94.4% 


93.9% 


101.0% 


238.4% 


-209_db 


100.0% 


99.1% 


102.1% 


99.4% 


99.3% 


102.3% 


116.5% 


_213_javac 


100.0% 


96.5% 


101.2% 


101.1% 


98.3% 


100.2% 


144.6% 


_222_mpegaudio 


100.0% 


99.4% 


100.7% 


99.8% 


99.4% 


100.6% 


99.1% 


_227-mtrt 


100.0% 


100.8% 


104.8% 


101.2% 


102.0% 


101.5% 


174.8% 


_228_jack 


100.0% 


99.4% 


103.8% 


99.5% 


98.9% 


101.3% 


204.6% 


Geometric mean 


100.0% 


98.2% 


102.6% 


99.0% 


98.5% 


101.6% 


166.8% 



b) Total Time excluding compilation (second iteration). 



Table 5: Total running time for one iteration of each of the SPEC JVM98 benchmarks, including garbage collection costs. 



mean for each program and collector over 8 heap sizes. All of these 
numbers are inclusive of garbage collection time. All but two of the 
SPEC JVM98 programs spend between 20 and 40% of their time 
in the Jikes RVM compiler; _201_compress (9-13%) and _209_db 
(5-7%) are the exceptions. Since the semi-space collector has no 
barrier, the lowest compilation times are for the semi-space collec- 
tor (e.g., a compile time of 5% for _209_db with no write barrier). 
When compiling the Jikes RVM boot image, the full, partial, and 
out-of-line barriers slowed the compiler down by 71%, 30%, and 
10% respectively, relative to the semi-space collector. 

Table 5a) shows times for the first iteration of each benchmark 
and is thus inclusive of compilation costs. Since the optimizing 
compiler compiles all methods prior to their execution, this table 
represents the worst case in terms of compilation load. Compar- 
ing between inline, partial, and out-of-line, a partially inlined slot 
barrier performs best, on average 10.2% better than inlining. With 
the compiler in the picture, an out-of-line slot barrier is better than 
an inlined one. The inlined object barrier performs better than an 
inlined slot barrier, but slot is the best using partial inlining. 

Table 5b) shows times for the second iteration of each bench- 
mark and is thus exclusive of compilation costs. This table repre- 
sents no compilation load, but does include the garbage collection 
time. In these results, inlining always performs better than an out- 
of-line barrier. Partial inlining offers a small and consistent advan- 
tage over full inlining. These results are consistent with the code 
quality improvements observed in Section 5.2 where we exclude 
garbage collection performance. 

With compilation, even an out-of-line barrier is better than an 
inlined barrier for the slot barrier by 6.8% on average, and for 
the object barrier, they differ only by around 3.5% on average. 
When compilation is eliminated or minimized, the out-of-line bar- 
rier takes its place as the worst performing barrier. 

Regardless of the compilation cost or the barrier code placement 
choice, a semi-space collector with no barrier performs worse than 
Appel with a barrier. As expected, the incrementality of the Ap- 
pel generational collector always yields significantly better perfor- 



mance than collecting the whole heap every time with the semi- 
space collector. For example, an inlined slot barrier with Appel 
is 33% faster than the semi-space collector with the compiler and 
66% without the compiler. In fact, these results are somewhat un- 
derstated because the second iteration will have lower memory re- 
quirements than the first because the heap will not contain any com- 
piler objects. In some cases, the semi-space collector is barely ex- 
ercised in the second iteration (e.g., _222_mpegaudio and -209- db). 

These results indicate that the impact of a partially inlined slot 
write barrier ranges from a 0.8% degradation to 18.5% improve- 
ment over a fully inlined write barrier, depending on the level of 
compilation. The more compilation, the greater the advantage par- 
tial inlining has over inlining. We also measure the activity of the 
adaptive compiler which is typically 12.5% of execution time. This 
level of activity is between half and one third that of the optimizing 
compiler, and will thus see the benefit of partial inlining. 

We obtain very similar results using the same experiment with a 
fixed-size nursery collector which, as we show, takes the slow path 
more frequently. Thus, we believe these results will hold across 
different collectors and compilers. 

6. Conclusion 

The write barrier is a key to the efficiency of many modern garbage 
collectors. Garbage collectors pay mutator write-barrier overheads 
to reduce copying overhead and improve total performance. Many 
researchers have spent their time trying to minimize the impact of 
the write barrier on the mutator. We show that the way in which 
the barrier is compiled can have. a considerable impact on overall 
performance, even if it is highly optimized. Write barriers are pro- 
lific and have highly regular bimodal execution patterns. These two 
characteristics bring into question a common practice of inlining 
write barriers. 

We find that fully inlining write barriers not only produces sub- 
optimal code, but dramatically increases the compiler's workload. 
By contrast, partial inlining reduces the compiler's workload by be- 
tween 20% and 25% as compared to full inlining, and consistently 
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leads to better quality code. This result is general in the context 
of write barrier compilation. Furthermore, it is likely to extend to 
other contexts, notably compiling allocation sequences which are 
also prolific and typically have well defined fast and slow paths. 
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